Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.
Section 0 (15 points)
The following code chunk will:
Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the observations.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.
## [1] "/Users/lzy23/Downloads"
## 'data.frame': 1500 obs. of 5 variables:
## $ Name : Factor w/ 25 levels "American Diner",..: 6 4 23 14 21 19 21 25 1 1 ...
## $ Distance : num 1.245 2.085 0.567 0.916 3.674 ...
## $ Price_of_Meal : num 11.79 14.4 8.16 8.23 6.08 ...
## $ Price_of_Drink: num 5.08 13.08 8.68 6.9 4.46 ...
## $ Category : Factor w/ 15 levels "American","Brazilian",..: 3 11 14 9 1 1 1 15 1 1 ...
Introduction ( 15 points)
The data are derived from an observational study of restaurants in the Chicago Area. There are 1500 observations in total. Continue and describe your data in detail. Number of observations, datatypes, etc… should be included.
##### Section 1: (10 points) Summarizing the data.
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using Category and Status.
## Name Distance Price_of_Meal Price_of_Drink
## Chinese Buffet : 69 Min. : 0.1068 Min. : 1.000 Min. : 1.270
## Italian Pizzeria: 69 1st Qu.: 1.5084 1st Qu.: 4.450 1st Qu.: 5.178
## Steakhouse : 69 Median : 3.0351 Median : 7.965 Median : 8.400
## Korean BBQ : 68 Mean : 3.0503 Mean : 8.051 Mean : 8.605
## Sushi Palace : 68 3rd Qu.: 4.5012 3rd Qu.:11.360 3rd Qu.:12.255
## BBQ Joint : 67 Max. :18.0000 Max. :56.000 Max. :16.000
## (Other) :1090
## Category FoodPrice TT_Price Status
## American :302 Min. : 3.12 Min. : 4.11 Expensive :631
## Mexican :161 1st Qu.:13.75 1st Qu.: 22.21 Inexpensive:172
## Italian :130 Median :18.57 Median : 28.61 Normal :697
## Japanese :129 Mean :18.65 Mean : 29.03
## Thai :122 3rd Qu.:23.00 3rd Qu.: 35.78
## Mediterranean:120 Max. :80.32 Max. :123.89
## (Other) :536
## Ratio
## Min. :0.1382
## 1st Qu.:0.5297
## Median :0.6376
## Mean :0.6642
## 3rd Qu.:0.7813
## Max. :1.0000
##
##
## Expensive Inexpensive Normal
## American 114 37 151
## Brazilian 27 11 27
## Chinese 30 9 30
## French 23 7 24
## German 21 9 24
## Indian 24 9 28
## Italian 61 13 56
## Japanese 49 11 69
## Korean 32 6 30
## Mediterranean 42 25 53
## Mexican 76 10 75
## Peruvian 31 1 24
## Spanish 18 8 27
## Thai 54 9 59
## Vietnamese 29 7 20
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Note that “Name” is a string data. “Distance”, “Price of Meal”, “Price of Drink”, “Food Price”, “TT Price”, and “Ratio” are all numerical continuous variable.”Category” and “Status” are both ategorical data. For “Distance”, “Price_of_Meal”, “Price_of_Drink”, “Food_price”, “Ratio”, and “TT_Price”, since the mean and median are close to each other, they all have small or minimal skew. However, using Q1-1.5IQR and Q3+1.5IQR, we can see that values below -2.98 or above 8.99 should be considered as outliers of Distance. Hence, the maximum value of Distance is one of the outliers. Values above 21.73 should be considered as potential outliers of “Price_of_Meal”. Values above 22.87 are potential outliers of “Price_of_Drink”. Values above 36.88 could be considered as potential outliers for “Food Price”. Values below 1.86 or above 56.14 could be considered as potential outliers of “TT price”. Values below 0.15 or above 1.16 could be considered as potential outliers of “Ratio”.
(1)(b) (2 point) Generate a table of counts using Category and Status. Add margins to this table. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
##
## Expensive Inexpensive Normal Sum
## American 114 37 151 302
## Brazilian 27 11 27 65
## Chinese 30 9 30 69
## French 23 7 24 54
## German 21 9 24 54
## Indian 24 9 28 61
## Italian 61 13 56 130
## Japanese 49 11 69 129
## Korean 32 6 30 68
## Mediterranean 42 25 53 120
## Mexican 76 10 75 161
## Peruvian 31 1 24 56
## Spanish 18 8 27 53
## Thai 54 9 59 122
## Vietnamese 29 7 20 56
## Sum 631 172 697 1500
(1)(c) (1 point) Generate a table of counts using only Category. Present
a barplot of these data, add a legend to your data
##
## American Brazilian Chinese French German
## 302 65 69 54 54
## Indian Italian Japanese Korean Mediterranean
## 61 130 129 68 120
## Mexican Peruvian Spanish Thai Vietnamese
## 161 56 53 122 56
Essay Question (2 points): Discuss the Category distribution of restaurants first, what do you observe? Was that expected? What stands out about the distribution of restaurants by Status if any?
We can see that through the distribution of restaurants, the number of American restaurants is significantly larger compared with the number of other types of restaurants. Note that Chinese, Korean, Indian, and other categories have fewer restaurants compared to the top categories such as Mexican, Italian, and Japanese. The distribution is expected because American and Mexican cuisines are quite popular in the states. In each category, the number of expensive and normal restaurants are about the same. However, the number of inexpensive restaurants have much smaller proportion in each category compared with expensive and normal restaurants.
(1)(d) (3 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “observations.” Instead, we need to sample from the integers, 1 to 1500, representing the rows of “observations.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-5 with plot(work[, 2:5]) (these are the continuous variables excluding TT_Price and Ratio). The sample “work” will not be used in the remainder of the assignment.
##### Section 2: (10 points) Summarizing the data using graphics.
(2)(a) (1 point) Use “observations” to plot FoodPrice versus Distance. Color code data points by Status.
Essay Question (3 points): Are there outliers? Are the outliers for all STatuses or specific ones, can you explain the reason that these outliers are present?
While the graph doesn’t demonstrate a distinct linear association between Distance and FoodPrice, some data points feature both a longer Distance and a higher FoodPrice. Notably, all of outliers are labeled as “Expensive.” The appearance of these outliers might be attributed to a couple of factors. Some Chicago restaurants, although located significantly away from Wieboldt Hall, might be enhancing their dining ambiance with unique offerings, such as live music. Furthermore, accolades like a Michelin star can greatly elevate a restaurant’s stature and price. It’s worth considering that such restaurants, even if they’re remotely located, can command higher prices due to their esteemed recognition. Additionally, it’s essential to account for the possibility of data collection inaccuracies or errors that could contribute to these outliers.
(2)(b) (3 points) Use “observations” to research the variables TT_Price, FoodPrice, and Distance. Run histograms for all variables and identify the distribution they follow. Discuss Skewness for any of those. To further understand the FoodPrice, drill further in researching on the components of Meal and Drink Price. Use histograms and boxplots for this investigation. ( Free form analysis), that will help you to form conclusions.
Essay Question (3 points): Do the variables or any specific follows normal distribution? Is there any of the simple variables skewed and how they affect the distribution of Food Price and or the TT_Price? How does the variability in them compare with each other?
Price_of_Drink has an approximately symmetric distribution so it follows normal distribution. Distance is positively skewed indicating that there are more restaurants are closer in distance. Similarly, by looking at the histogram, Price_of_Meal also has a positive skewness, suggesting that there are a few restaurants that charge premium prices. However, if we ignore the outliers and look at the boxplot, the data of Price_of_Meal follows normal distribution. The skewness in Price_of_Meal will influences the skewness observed in FoodPrice since the meal price is a crucial part of the total food price. The skewness in Distance also impacts TT_Price since transportation costs are added, especially for restaurants that are farther away. Since TT_Prices combines FoodPrice and Distance, it has the highest variability as expected. We can see there is a moderate variability in meal and drink prices across restaurants through the histogram.
### Section 3: (15 points) Getting insights about the data using graphs.
(3)(a) (5 points) Use “observations” to create a multi-figured plot with histograms, boxplots and Q-Q plots of Ratio differentiated by Status. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
(3)(b) (5 points) Repeat the process for the step 3a but now replace the
Status with Category.
Essay Question (5 points): Compare the displays.How do the distributions compare to normality? Take into account the criteria discussed to evaluate non-normality.
Status Vs Ratio: Through the histogram of Status vs Ratio, We can see that the distributions of “Expensive”, “Inexpensive”, and “Normal” status are close to bell-shaped but not symmetric. In addition, three status are all right-skewed. The right-tails in three histograms shows that there are some potential outliers. Then when we look at the boxplot of Status vs Ratio, the “Normal” and “Inexpensive” status are also not symmetric. However, compared to them, the boxplot of “Expensive” category is more symmetric and close to normality. Finally, through the QQ-plot of Status vs Ratio, only the central part of “Expensive” and “Normal” status is convergent to the QQ-line. In other words, there are clear deviations from the QQ-line(45-degree) line within three status, which indicates that the distributions are not normal. Furthormore, we can also see that three status have more pronounced deviations in the upper tail. Hence, by comparing the displays, we can see that all Ratio distributions for three status of restaurants are deviated from normality.
Category Vs Ratio: Through the histogram of Category Vs Ratio, most categories such as “American” and “Thai” have right-skewed distribution. In my opinion, Japanese is the only category that close to the normal distribution even though there are still some potential outliers in the plot. Other categories are not close to perfect bell-shaped and symmetric distribution. Then through the boxplot of Category Vs Ratio, we can see that “French”, “Mediterranean”, and “Japanese” are closer to normality since their boxplots are more symmetric than other categories. Finally, if we look at the QQ-plot of Category Vs Ratio, the most part of QQ-plots of “French”, “Mediterranean”, and “Japanese” are convergent to the QQ-line. However, the upper tails of these three categories are still divergent from the QQ-line. In addition, All other categories show some deviations from the QQ-line(45-degree) line. More specifically, Italian and German restaurants have more deviation at the upper tails while Spanish and Vietnamese restaurants have more deviation at the lower tails than other categories. Therefore, the deviation from the QQ-line at the upper and lower tails indicates the non-normality.
### Section 4: (10 points) Getting insights about possible predictors.
(4)(a) (5 points) Now we want to focus on the most common categories. Find the first 7 Categories in terms of the number of restaurants included to those. Run a summary() and histograms to identify and compare them. Create boxplots of TT_Price by these 7 categories. To do this, you may want to identify those and then you can index the observations to get a working subset with only the observations of these categories. Also run boxplots for both TT_Price and Distance by Common Category.
## Name Distance Price_of_Meal Price_of_Drink
## Chinese Buffet : 69 Min. : 0.1068 Min. : 1.000 Min. : 1.280
## Italian Pizzeria : 69 1st Qu.: 1.5243 1st Qu.: 4.580 1st Qu.: 5.040
## Steakhouse : 69 Median : 3.0248 Median : 8.000 Median : 8.310
## Sushi Palace : 68 Mean : 3.0487 Mean : 8.142 Mean : 8.464
## BBQ Joint : 67 3rd Qu.: 4.4861 3rd Qu.:11.370 3rd Qu.:12.090
## Thai Noodle House: 66 Max. :18.0000 Max. :56.000 Max. :16.000
## (Other) :625
## Category FoodPrice TT_Price Status
## American :302 Min. : 3.12 Min. : 4.38 Expensive :426
## Chinese : 69 1st Qu.:13.56 1st Qu.: 22.53 Inexpensive:114
## Italian :130 Median :18.37 Median : 28.62 Normal :493
## Japanese :129 Mean :18.60 Mean : 28.95
## Mediterranean:120 3rd Qu.:22.98 3rd Qu.: 35.43
## Mexican :161 Max. :80.32 Max. :123.89
## Thai :122
## Ratio
## Min. :0.1382
## 1st Qu.:0.5242
## Median :0.6402
## Mean :0.6655
## 3rd Qu.:0.7823
## Max. :1.0000
##
Essay Question (5 points) Do you believe that these categories represent prices and distances of all restaurants well, or are there biases on their values?
To determine if the top 7 restaurant
categories can be representative of all categories, I put the histogram
and boxplots for Total Price and Distance of both top 7 categories and
all categories of restaurants side by side above.
The median distance and TT_price for each category align closely with the mean values of the entire dataset. The shape of the histogram for the top 7 categories is close to that of the histogram representing all categories. In addition, there are 1,033 data points within the subset of the top 7 categories, constituting a significant portion of the original dataset. This suggests that these categories can provide insights into the pricing and distances of all restaurants. However, upon analyzing the boxplots for Total Price specific to the top 7 categories, their distributions appear largely symmetrical and normal. This deviates from the distribution seen in the entire dataset. A similar observation is made when inspecting the Distance boxplots for the top 7 categories. Most of these also demonstrate symmetry and normality, with exceptions in the “Italian” and “Japanese” categories. Once again, this is in contrast to the distribution observed across the full dataset. In general, while the top 7 categories present data that is largely symmetrical and normally distributed, they may not be wholly representative of the entire spectrum of restaurants. This is mainly due to the presence of eight other categories that, although neglected in this subset, play a pivotal role in influencing the overall distribution of restaurant data.
### Section 5: (10 points) Getting insights regarding different groups in the data.To do this, we now limit our analysis to only the first most common categories of restaurants
(5)(a) (2 points) Use aggregate() with “observations” to compute the mean values of FoodPrice, Total Price (TT_Price) and Ratio for each combination of Category and Status. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by Status and columns by Category. The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
| American | Chinese | Italian | Japanese | Mediterranean | Mexican | Thai | |
|---|---|---|---|---|---|---|---|
| Expensive | 25.926579 | 24.668667 | 24.690328 | 26.157551 | 24.25095 | 24.83145 | 24.674815 |
| Inexpensive | 7.694324 | 7.257778 | 8.042308 | 8.597273 | 6.77480 | 7.99600 | 6.354444 |
| Normal | 15.683642 | 15.493333 | 15.308036 | 15.727536 | 15.37377 | 15.19427 | 15.339322 |
| American | Chinese | Italian | Japanese | Mediterranean | Mexican | Thai | |
|---|---|---|---|---|---|---|---|
| Expensive | 36.75395 | 35.98067 | 36.03410 | 37.55531 | 34.41643 | 35.06276 | 34.64519 |
| Inexpensive | 16.80027 | 20.94778 | 18.69538 | 18.15636 | 16.41920 | 18.76600 | 11.54000 |
| Normal | 25.50669 | 24.98667 | 25.84411 | 25.72986 | 25.72075 | 25.17773 | 26.84288 |
| Chinese | Mexican | Thai | American | Italian | Mediterranean | Japanese | |
|---|---|---|---|---|---|---|---|
| Normal | 0.7301425 | 0.7096989 | 0.7053911 | 0.7248966 | 0.7356169 | 0.7295717 | 0.7337815 |
| Expensive | 0.5759378 | 0.3655736 | 0.5308534 | 0.5471574 | 0.5089807 | 0.5364901 | 0.6308457 |
| Inexpensive | 0.6563062 | 0.6615544 | 0.6388866 | 0.6556623 | 0.6330138 | 0.6435776 | 0.5982316 |
(5)(b) (4 points) Present three graphs. Each graph should include three lines, one for each Category. The first should show mean Ratio versus Status; the second, mean Total Price versus Status; the third, mean Food Price versus Status. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider distance and category of restaurant.
The plot illustrating Mean Food Price against Status reveals a consistent trend: within each status, the average food price across different restaurant categories remains relatively uniform. However, when we look the metrics of Total Price and Ratio, discrepancies become evident among the seven prominent categories. It’s crucial to note that Total Price factors in Distance, which significantly impacts both the total price and the ratio, particularly for the inexpensive eateries within each category. We need a deeper exploration into how distance influences the Ratio and Total Price for budget-friendly options is needed. For instance, the Ratio for inexpensive Thai restaurants surpasses that of their normally priced counterparts. This suggests that budget Thai dining options are generally closer in proximity compared to the mid-tier Thai establishments.
Essay Question (2 points): What do these displays suggest about eating out expenses? Also, compare the ratio for each of the different categories, do they give an insight about locations?
The visual representations indicate that within a given status, food prices across different restaurant categories are relatively consistent. Yet, a closer look at the graph reveals that within each restaurant category, the Ratio for the “expensive” status surpasses that of the other two statuses. This is mainly because the food cost is significantly higher than travel expenses for these upscale dining options. A notable observation is that while the seven top restaurant categories display comparable proportions in both “expensive” and “normal” statuses, distinctions emerge within the “inexpensive” status bracket. For example, while the food prices of inexpensive Chinese and Thai restaurants are similar, their Ratio varies significantly. The lower Ratio for Chinese restaurants suggests they are typically located further away compared to other categories, making them less accessible for those prioritizing proximity. As mentioned from the previous discussion, the Ratio for inexpensive Thai restaurants is higher than that for normal priced Thai dining restaurants This implies that Thai restaurants offer not only affordable food but also the convenience of shorter travel distances. Thus, for those looking to minimize both food and transportation expenses, choosing Thai food seems to be an economical choice.
### Section 6: (15 points) Conclusions from the Exploratory Data Analysis (EDA).
Conclusions
Essay Question (5 points) Based solely on these data, what are plausible statistical reasons that explain the expenditure of Chicagoans who reside in downtown on restaurant eating?
1. General Pricing Overview: The barplot showcasing the relationship between Category and Status highlights a clear predominance of “Expensive” and “Normal” priced restaurants. This indicates that dining out in downtown Chicago tends to be on the pricier side.
2. Availability of Choices: Within a five-mile radius of the downtown area, thousands of restaurants are available for residents. While there are some inexpensive dining options in downtown Chicago, the variety is limited. Compared to higher-priced options, residents have fewer inexpensive restaurants to choose from.
3. Outliers in Pricing and Distance: The scatterplot of FoodPrice against Distance identified certain expensive restaurants as outliers. These outliers are characterized by either exorbitant food prices or their considerable distance from downtown. For downtown residents who opt for these outliers, the cost of a meal can exceed one hundred dollars.
4. Analysis of Top Seven Categories: The boxplots for Total Price and Distance, segmented by the top seven restaurant categories, suggest uniformity across the board. On average, the price and distance associated with these top categories are quite comparable. Furthermore, when examining the Mean Total Price Vs Status, the total price differential between the “expensive” and “normal” categories isn’t stark for downtown Chicago residents.
5. Financial Considerations for Inexpensive Dining: For those prioritizing cost and convenience, Thai restaurants stand out as a desirable option. The mean total price for Thai food is lower relative to other categories. Moreover, the higher Ratio indicates that choosing Thai food doesn’t necessitate long commutes or significant transportation costs. In essence, Thai restaurants offer both affordability and proximity for downtown Chicagoans.
Essay Question (15 points) Summarize your findings in a half a page paragraph
First of all, through the barplot of Category and Status, Chicago residents perhaps favor American and Mexican restaurants since these two categories are the most numerous in Chicago. Of course we would need to survey Chicago residents for preference data to make further inferences. Note that there’s a predominance of “Expensive” and “Normal” priced restaurants in the barplot, it suggests that dining out in downtown Chicago is generally costly. Using the scatterplot of FoodPrice vs Distance, we can see that distance does not have a very distinctive feature with the price of the food itself, in other words Chicagoans are able to find expensive, normal priced, and inexpensive restaurants at the same distance. We know the histogram of Price_of_Drink is close to uniform which means the price of drink in every categories and status of restaurant in Chicago area is about the same and it would not significantly impact the total price. Nevertheless, the skewness in Price_of_Meal will influences the skewness observed in FoodPrice since the meal price is a crucial part of the total food price. The skewness in Distance also impacts the Total Price since transportation costs are added, especially for restaurants that are farther away.
The data presents intriguing distributions for restaurant categories based on their statuses of “Expensive,” “Inexpensive,” and “Normal”. When we analyze these distributions through graphical representations like histograms and boxplots, we observe that most of them are predominantly right-skewed. This skewness suggests that these categories are not perfectly symmetric in their distribution. Further, the presence of right tails in these plots points to potential outliers in the data. The deviations from the standard 45-degree line in QQ-plot for all three statuses—especially noticeable in the upper tail—indicate that the Ratio distributions for these statuses are far from the normal distribution. When dissecting the data by restaurant types, certain patterns emerge. Categories such as “American” and “Thai” both display a right-skewed distribution when viewed in the histogram. In contrast, the “Japanese” category seems to come closest to a normal distribution, though there still exists the possibility of outliers within this set. Boxplots and QQ-plots further reveal interesting nuances. “French,” “Mediterranean,” and “Japanese” categories appear closer to a normal distribution when compared to others. However, the upper tails show deviations, indicating non-normality. There are also distinctive variations in other categories. For instance, both “Italian” and “German” categories display non-normal distributions, as indicated by their respective boxplots and QQ-plots. It’s noteworthy that the top 7 restaurant categories present data distributions that tend towards symmetry and normality, especially when considering factors like total price and distance. However, a potential pitfall in making overarching conclusions from these top categories is that they might not be wholly representative of all restaurant categories. This limitation arises from the potential differences in the distribution of the entire dataset and the omission of eight other categories. Finally, even though food prices are relatively consistent across different categories within a specific status, the proportion of restaurants that are deemed “expensive” is heightened because their food prices are significantly higher compared to their transportation costs. Note that while many categories exhibit similar proportions in both “expensive” and “normal” statuses, disparities arise in the “inexpensive” status. Here, inexpensive Chinese restaurants tend to be located further away, making Thai restaurants a more cost-effective choice when considering both food and transportation costs.
All in all, dining trends in Chicago reveal a predominance of expensive and normal-priced restaurant options, particularly in the downtown area, suggesting a general costliness to dining out. Most restaurant categories exhibit similar distributions in terms of price and distance, with notable exceptions being the distinctions in inexpensive statuses across categories. Despite the variety available within five miles of downtown Chicago, residents have fewer inexpensive dining choices and must consider both the cost of meals and transportation. The data emphasizes the nuances in dining costs across categories, hinting at potential dining strategies for those conscious of their spending.